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various  models  of  Markov- renewal  programming  were  presented  in  [11], 

^  "  •  ■ 
ralatr«4-t»  thereafter  as .  MRR  ■  I  and  MM  LI;  related  models ,were  presented  independ¬ 
ently  by  de  Cani  Howard  [10],  and  Schweitzer  [15].  Since  that  time,  new 

results  have  been  obtained  which  clarify  certain  questions  raised  in  these  papers. 
Since  these  results  are  either  "in  the  folklore,"  or  are  available  only  in  scattered 
unpublished  reports,  it  seemed  worthwhile  to  gather  them  together  in  one  article. 

We  shelly  considecjonly  the  finite-state,  finite  alternative  space,  infinite-horizon, 
undiscounted  and  discounted  models  discussed^ 


LINEAR  PROGRAMMING  FORMULATION  AND  RESULTS 


'/ 


It  is  well-known  that  Markov  programs  [9]  can  be  represented  as  linear  programs. 
The  first  such  formulations  are  apparently  due  to  Oliver  [14],  Manne  [13],  D’Epenoux 
[3],  de  Ghellinck  [7],  Wolfe  and  Dantzig  [17],  and  Derman  [4].  The  extension  of 
these  formulations  to  Markov-renewal  programs  is  straightforward,  and  the  resulting 
primal  and  dual  programs  for  both  the  undiscounted  and  discounted  infinite-horizon 
cases  are  given  in  matrix  form  by  Howard  [10].  Because  some  of  the  details  in  the 
interpretation  of  the  dual  programs  are  not  given,  we  consider  these  formulations, 
using  the  notation  of  [11]. 

In  the  case  of  undiscounted  rewards,  the  primal  problem  is: 


(1) 


Minimize 


Subject  to:  £(6^  -  +  v*xQ  >_  p* 


Xj  unrestricted 


fc- 1>2 . n) 

(j  *=  0,1,  ...,  N) 


2 


where  z  =  z(i)  varies  over  all  alternatives  available  in  that  state.  (Usually  we 
set  ^  e  0  .)  At  optimality,  x^  equals  the  maximal  value  of  the  gain  rate,  g  , 
and  the  x^  equals  the  corresponding  relative  value 

(2)  xi  *  vi  "  wi  "  wn  (i  -  1,2 . N) 

in  the  limiting  form  of  the  total  reward  over  [0,t]  : 

(3)  lim  v  (t)  -  gt  «  wi  +  0(1) 

t-*» 

(We  assume  an  ergodic  underlying  Markov  chain.  See  (2) (B.6) (B. 7) (12)  in  MRP  II  and 
and  (100) (104)  of  [10].) 


2j 

The  dual  to  (1),  after  dividing  by  ,  is: 


(A) 


Maximize 


Subject  to: 


l  l  vl  -  1 

Z  1 

y*  >  o 


(j  -  1,2 . N) 


z 

Directly  from  the  constraints,  we  see  that  the  y^  have  the  interpretation  of 

"mixing  coefficients"  for  the  various  alternatives  in  state  i  times  the  probability 

z 

of  being  in  that  state;  that  is,  if  a  pure  policy  were  used,  y^  ■  for  some 
z  ■  z*(i)  ,  and  equals  zero  for  all  other  alternatives  available  in  that  state. 

Then,  since  1®  the  rate  at  which  the  reward  is  earned  when  in  state  i  , 

following  alternative  z  ,  the  maximand  is  just  the  average  rate  at  which  reward  is 
earned,  at  an  arbitrary  transition  of  the  process — as  it  should  be. 


In  the  case  of  discounted  reward,  the  optimal  policy  simultaneously  maximizes 


the  total  discounted  reward,  starting  in  state  i  ,  v^ot)  ,  for  all  states  i  . 
Thus,  one  can  take  any  arbitrary  set  of  initial  starting  probabilities, 

(i  -  1,2,  ...,  N)  ,  and  formulate  the  primal  as: 


Minimize 

|  Vj 

(5) 

Subject  to: 

\(*ij  -  pi/ij<a))xj 

<  H- 
N 

n 

i-* 

«* 

NJ 

** 

• 

• 

• 

«• 

Xj  unrestricted 

(j  =  1,2,  ....  N) 

with 

the  optimal  values  of  x^ 

being  the  maximal  values  of  the 

v^(cx)  .  (This  is 

a  slight  generalization  of  (111)  of  [10].)  Directly,  the  dual  of  (5)  is: 


Maximize  J  J  y*p* 
i  z 


(6) 

Subject  to:  £  yz 
z  J 

■  aj + 1\  yipij£ij<0) 

O  -  1,2. 

U  “  1,2, 

\V* 

To  see  the  correct 

interpretation  of 

(6),  define 

(7) 


M1J(t;a) 


mean  discounted  number  of  entries  into  state  j 
in  [0,t]  ,  starting  in  state  i  . 


From  first  principles,  or  the  undiscounted  result  (C.6)  of  MRP  II  (where 
(t)  did  not  include  the  event  at  the  origin): 


(8) 


t 

Mij (fc* a)  c  6ij  +  I  / e"axMkJ(t-x;a)dQik(x) 

k  0 


In  the  limit,  clearly 


and 


(9) 


M' 


a 


ij  ij 


+  I  plk£ik<a)Mkj 


ij 


l  Mikpkjfkj(0) 


The  transposition  follows  from  the  well-known  commutativity  of  the  transforms  of 
the  matrices  Q^(t)  an<*  (t)  (undiscounted)  [12].  If  (9)  were  multiplied  by 
the  a^  and  summed,  an  expression  for  the  mean  discounted  number  of  visits  to  state 
j  ,  under  the  starting  conditions  of  (5) .  would  be  obtained.  Comparing  this  with 
(6),  we  see  that  the  dual  variables  y^  are  just  the  appropriate  alternative  mixing 
probabilities  times  the  discounted  number  of  visits  to  state  i  ;  from  this,  it 
follows  that  the  dual  maximand  in  (6)  is  the  correct  value  of  total  discounted 
return.  For  recent  work  on  programming  models,  see  [5],  r5],  and  [8], 

By  straightforward  use  of  the  theory  of  linear  programming,  and  the  special 
structure  of  the  constraint  matrix  in  [A]  and  [9],  it  follows  that  pure  strategies 
are  optimal  for  both  problems,  a  result  first  noted  by  Wagner  [16]  for  Markov 
programs.  In  fact,  the  "policy  improvement  routine"  is  nothing  more  than  a  special 
version  of  the  (dual)  simplex  method,  in  which  simultaneous  changes  of  several  basis 
vectors  (a  basis  is  a  selection  of  pure  alternatives)  are  possible  at  each  iteration, 
a  fact  noted  by  Oliver  [  14 ]  and  de  Ghellinck  [7];  no  Phase  I  is  needed,  since  any 
pure  strategy  is  (dual)  feasible.  Schweitzer  [15]  has  shown  that  it  is  the  con¬ 
vention: 

"If  there  is  no  improvement  in  the  test  quantity  from  the  last  cycle, 
retain  the  same  alternative" 

in  the  policy-improvement  algorithm  which  avoids  cycling  in  the  simplex  method  when 
there  is  degeneracy  (tie  among  pure  policies).  Thus  the  algorithm  is  always  finite. 

In  our  discussion  about  alternative  test  criteria  for  the  undiscounted  case 
(Appendix  D,  MRP  II),  both  Schweitzer  and  the  author  missed  the  "pricing-in" 
criterion  which  arises  naturally  from  the  linear  programming  models.  It  follows 
directly  from  (1)  and  (5)  above  that  policy  improvement  will  always  occur  if  the 


..  ^  r'S***  i' 
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following  rules  are  adopted  for  the  algorithms  of  Figure  1,  MRP  I,  and  Figure  2, 

MRP  II: 

1.  (Discounted,  Infinite-Horizon  Case):  For  at  least  one  state  i  , 
select  a  new  alternative  z(i)  for  which 

(10)  vi  <  pi  +  |  pijfij(a)vj  » 

using  the  current  values  of  the  discounted  returns,  . 

2.  (Undiscounted,  Infinite-Horizon  Case):  For  at  least  one  state 
i  ,  select  an  alternative  z(i)  for  which 

(H)  +  gv*  <  p*  +  l  Vj  , 

using  the  current  valueo  of  the  gain,  g  ,  and  the  relative  values,  v^  . 
Thus,  both  criteria  (D.l)  and  (D.2)  of  MRP  II  are  merely  different  ways  of  ranking 
prospective  candidates  to  enter  the  basis  at  the  next  iteration,  and  the  question  of 
relative  efficiency  becomes  undecidable  without  analysis  of  the  computational  labor 
required  and  experimental  tests.  Similar  remarks  apply  to  the  question  of  rate  of 
convergence  if  several  candidates  are  placed  in  the  basis  simultaneously. 

To  summarize,  the  policy  improvement  algorithm  of  Markov  and  Markov-renewal 
programming  is  simultaneously  a  dynamic  programming  algorithm  and  a  simplex 
algorithm. 

TIES  AND  ABSOLUTE  VALUES  OF  THE  BIAS  TERMS 


Blackwell  [1]  remarked  that  the  relative  values  in  Markov  programming  were 
insufficient  to  break  ties  among  policies  with  the  same  gains  and  produced  an 
explicit  formula  for  the  absolute  values  of  the  RHS  of  (3),  in  terms  of  the 
"fundamental  matrix"  of  Markov  chains.  The  formula  established  for  the  absolute 
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values  in  (B.6)  of  MRP  II  for  Markov- renewal  programs  was: 


(12) 


Mi  ■  llm  vi(t)  -  gt  ■  \  “ijpj  '  l  (nj/ujj) 


t-Mo 


where  the  notation  is  the  same  as  in  MRP  II,  except  for  the  new  definition: 


(13) 


U) 


ij 


(2) 

ii _ + 

A  11 


w 


2  p 


jj 


'ij 


(The  above  limit  may  be  in  a  Ceshro  sense.) 


This  formula,  involving  the  first  and  second  moments  of  the  first-passage  distri¬ 
bution,  is  too  complicated  for  rapid  computation  of  the  Wj,  ;  similar  remarks  apply 
to  a  reduction  of  (B.6)  by  Schweitzer  to  a  form  involving  the  fundamental  matrix 
(Equation  (5.112)  of  [15]),  and  to  remarks  by  Fcx  [5]. 

However,  from  basic  definitions,  and  (C.8)  of  MRP  II,  we  have: 


(14) 


ll 

i  j 


.(2) 


Vijvij“jk 


2p 


kk 


(2) 

where  v  is  the  second  moment  of  an  average  transition  interval.  Then,  from 
(12),  we  have  the  remarkable  formula: 


(15) 


l  J  ViJviJuj 


(2) 


-  I 

j 


Yj 


(This  result  was  first  obtained  in  [12].) 


Thus,  once  the  stationary  probabilities  are  known,  the  normalizing  factor  to 

change  the  relative  values  v^  to  the  absolute  values  w^  follows  directly.  If 
the  inverse  of  the  matrix  used  in  the  value-determination  part  of  the  algorithm  is 
saved,  then  the  may  be  determined  from  the  row  of  the  inverse  corresponding 


;-  era*  MjnsrusWF*  *%???''■■  HFK  * 
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to  the  variable  . 

Although  the  above  remarks  do  not  make  tie-breaking  a  trivial  procedure,  they 
do  indicate  that  at  most  one  need  only  carry  out  the  value-determination  procedure 
for  every  tying  policy. 


•  >sfr  ■*  •  • 
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